Data: rental listings of aparments in NY City
Goal: predict level of interests
Some observations/questions raised during data exploration (would need to check in real-world use-case):
Is target normalised to the same listing period of length (e.g. number of days or weeks)? I suspect not as there is some correlation between listing id and interest level.
Use at least 2-year of historical data to build model to account for seasonal effects (e.g. school holidays, vacation)
Should also use out of time sampling and testing
Target distribution shows imbalanced classes.
rr table(d$interest_level)
high low medium
3839 34284 11229
Interest level is quantitative and we can encode the targets as numerical values and build a regression model. Future work.
rr sapply(d, class)
Train and test in the same periods of 3 months, so there could be information leak due to seasonal effects. Proportion of listings in each month is same in both train and test.
rr min(d$created)
[1] \2016-04-01 22:12:41\
rr max(d$created)
[1] \2016-06-29 21:41:47\
rr min(d_test$created)
[1] \2016-04-01 22:23:31\
rr max(d_test$created)
[1] \2016-06-29 21:55:35\
rr d <- mutate(d, month_created = month(as_datetime(created))) d_test <- mutate(d_test, month_created = month(as_datetime(created))) table(d$month_created)
4 5 6
16411 16626 16315
rr table(d_test$month_created)
4 5 6
24879 24987 24793
Listing IDs are identifiers so remove.
rr d <- select(d, -listing_id)
There are missing values for latitude and longitude (entered as 0), and outliers for prices. Leave them in for now as tree-based model can handle such cases
rr summary(select(d, bathrooms, bedrooms, latitude, longitude, price))
bathrooms bedrooms latitude longitude price
Min. : 0.000 Min. :0.000 Min. : 0.00 Min. :-118.27 Min. : 43
1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:40.73 1st Qu.: -73.99 1st Qu.: 2500
Median : 1.000 Median :1.000 Median :40.75 Median : -73.98 Median : 3150
Mean : 1.212 Mean :1.542 Mean :40.74 Mean : -73.96 Mean : 3830
3rd Qu.: 1.000 3rd Qu.:2.000 3rd Qu.:40.77 3rd Qu.: -73.95 3rd Qu.: 4100
Max. :10.000 Max. :8.000 Max. :44.88 Max. : 0.00 Max. :4490000
rr sum(d$latitude == 0)
[1] 12
rr sum(d$longitude == 0)
[1] 12
Intuitively we expect manager, buildings, and street address all to be predictive of the target. But they are categorical features with many values. Using them directly may lead to complex model that can overfit, so we try to convert them into vector of real values. However we must be careful to not leak information from train to validation during cross-validation, especially for managers / building with few listings. Use threshold to determinte when to use.
rr length(unique(d$building_id))
[1] 7585
rr length(unique(d$manager_id))
[1] 3481
rr length(unique(d$display_address))
[1] 8630
rr length(unique(d$street_address))
[1] 15358
In fact, we see that they all exhibit heavy tail distribution, meaning a top few instances have many listings and the majority have few listings.
rr inspect_categorical_var(d, ‘building_id’)
rr inspect_categorical_var(d, ‘manager_id’)
rr inspect_categorical_var(d, ‘display_address’)
Build features for each manager in terms of the interest levels to their properties. We see that managers vary in their listing performance, for example
rr qplot(x = mngr_prop_high, data=manager_features, geom=‘histogram’, binwidth=0.1)
rr qplot(x = mngr_prop_low, data=manager_features, geom=‘histogram’, binwidth=0.1)
We expect property location to contribute to interest level. This information is contained in lat and long, but we can also leverage display address to gauge the interest at street level. But to aggregate on street level, we need to see if there are enough houses on street.
rr street_features <- dplyr::count(d, display_address) summary(street_features$n)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 5.719 3.000 438.000
rr print(‘top 10% percentile street’)
[1] \top 10% percentile street\
rr quantile(street_features$n, 0.9)
90%
10
So 10% of streets have at least 10 houses, we may build features for these top streets.
We zoom in in the centre.
d_zoom <- d %>% filter(longitude > -74.5 & longitude < -73.5 & latitude > 40.5 & latitude < 41)
ggplotly(ggplot(d_zoom, aes(longitude, latitude)) + geom_jitter(aes(color = interest_level), alpha=0.5, size=0.2))
Looking at density on each dimension we see that there are relatively higher concentration of high interest properties near the edge.
rr ggplot(d_zoom, aes(x = latitude, color=interest_level)) + geom_density() ggplot(d_zoom, aes(x = longitude, color=interest_level)) + geom_density()
Can also look at medium and high only, but same information as prev plots.
rr ggplotly(filter(d_zoom, interest_level != ‘low’) %>% ggplot(aes(longitude, latitude)) + geom_jitter(aes(color = interest_level), alpha=0.5, size=0.5))
We augement features with manager, street, and building features.
Interest levels are ordinal, so convert to -2, 0, 2 and plot correlations. The newly features correlate much stronger to the target than original features.
Unknown variables: `NA`
Let’s build a first pass model using numerical features only. We can deal with text features later and ensemble the models.
Since data is very imbalanced, we should use xgboost with case weights
The CV error:
registerDoParallel(4)
getDoParWorkers()
[1] 4
params <- list(eta = 0.2, gamma = 0, max_depth = 6, min_child_weight = 2,
subsample = 1, colsample_bytree = 0.7,
num_class = 3, objective = "multi:softprob", eval_metric = "mlogloss")
xgb_d_train <- xgb.DMatrix(data.matrix(select(d_train, -interest_level)),
label=to_numeric(d_train$interest_level),
#weight=case_weights(d_train$interest_level),
missing=NA)
xgb_d_val <- xgb.DMatrix(as.matrix(select(d_val, -interest_level)),
label=to_numeric(d_val$interest_level), missing=NA)
xgb_cv <- xgb.cv(params, xgb_d_train, nfold = 5,
nrounds = 200, early.stop.round = 3, nthread = 4)
[0] train-mlogloss:0.975658+0.000430 test-mlogloss:0.978276+0.002567
[1] train-mlogloss:0.891308+0.001711 test-mlogloss:0.896319+0.003286
[2] train-mlogloss:0.826890+0.002722 test-mlogloss:0.834186+0.004270
[3] train-mlogloss:0.778011+0.002824 test-mlogloss:0.787518+0.005751
[4] train-mlogloss:0.740331+0.002795 test-mlogloss:0.751918+0.007003
[5] train-mlogloss:0.709880+0.003059 test-mlogloss:0.723541+0.008167
[6] train-mlogloss:0.684465+0.002456 test-mlogloss:0.700254+0.008787
[7] train-mlogloss:0.665196+0.001879 test-mlogloss:0.682704+0.009882
[8] train-mlogloss:0.648491+0.002415 test-mlogloss:0.667708+0.010396
[9] train-mlogloss:0.635054+0.002453 test-mlogloss:0.655909+0.010157
[10] train-mlogloss:0.623336+0.001712 test-mlogloss:0.645819+0.011107
[11] train-mlogloss:0.614134+0.001842 test-mlogloss:0.638026+0.011630
[12] train-mlogloss:0.605720+0.002608 test-mlogloss:0.631306+0.011715
[13] train-mlogloss:0.598375+0.002192 test-mlogloss:0.625359+0.011649
[14] train-mlogloss:0.591140+0.002253 test-mlogloss:0.619835+0.012330
[15] train-mlogloss:0.585345+0.001637 test-mlogloss:0.615354+0.012511
[16] train-mlogloss:0.579080+0.002636 test-mlogloss:0.610713+0.011939
[17] train-mlogloss:0.573491+0.002345 test-mlogloss:0.606533+0.012236
[18] train-mlogloss:0.568807+0.002532 test-mlogloss:0.603161+0.012168
[19] train-mlogloss:0.564464+0.002699 test-mlogloss:0.600194+0.012240
[20] train-mlogloss:0.560930+0.002898 test-mlogloss:0.597806+0.012083
[21] train-mlogloss:0.557446+0.003100 test-mlogloss:0.595803+0.012002
[22] train-mlogloss:0.554403+0.003009 test-mlogloss:0.593935+0.012145
[23] train-mlogloss:0.551447+0.003563 test-mlogloss:0.592240+0.011909
[24] train-mlogloss:0.548464+0.003755 test-mlogloss:0.590671+0.012105
[25] train-mlogloss:0.545571+0.004182 test-mlogloss:0.589393+0.011777
[26] train-mlogloss:0.543044+0.004346 test-mlogloss:0.588047+0.011581
[27] train-mlogloss:0.540234+0.003755 test-mlogloss:0.586630+0.011844
[28] train-mlogloss:0.537753+0.003585 test-mlogloss:0.585549+0.011968
[29] train-mlogloss:0.535315+0.003391 test-mlogloss:0.584541+0.012197
[30] train-mlogloss:0.532663+0.003374 test-mlogloss:0.583288+0.012317
[31] train-mlogloss:0.530812+0.003740 test-mlogloss:0.582530+0.012134
[32] train-mlogloss:0.528689+0.003750 test-mlogloss:0.581529+0.012394
[33] train-mlogloss:0.526245+0.003710 test-mlogloss:0.580697+0.012293
[34] train-mlogloss:0.524128+0.003750 test-mlogloss:0.579800+0.012347
[35] train-mlogloss:0.522219+0.003444 test-mlogloss:0.579116+0.012306
[36] train-mlogloss:0.520660+0.003472 test-mlogloss:0.578578+0.012306
[37] train-mlogloss:0.519154+0.003623 test-mlogloss:0.578059+0.012303
[38] train-mlogloss:0.517316+0.003468 test-mlogloss:0.577508+0.012354
[39] train-mlogloss:0.515643+0.003364 test-mlogloss:0.576867+0.012402
[40] train-mlogloss:0.514125+0.003555 test-mlogloss:0.576361+0.012395
[41] train-mlogloss:0.512855+0.003614 test-mlogloss:0.575968+0.012409
[42] train-mlogloss:0.510892+0.003478 test-mlogloss:0.575567+0.012591
[43] train-mlogloss:0.509356+0.003455 test-mlogloss:0.575082+0.012532
[44] train-mlogloss:0.508112+0.003257 test-mlogloss:0.574789+0.012644
[45] train-mlogloss:0.506126+0.003169 test-mlogloss:0.574399+0.012756
[46] train-mlogloss:0.504673+0.003423 test-mlogloss:0.573898+0.012839
[47] train-mlogloss:0.503133+0.003507 test-mlogloss:0.573637+0.012888
[48] train-mlogloss:0.501522+0.003603 test-mlogloss:0.573297+0.012886
[49] train-mlogloss:0.499806+0.003924 test-mlogloss:0.572882+0.012953
[50] train-mlogloss:0.498512+0.003668 test-mlogloss:0.572587+0.013059
[51] train-mlogloss:0.497459+0.003838 test-mlogloss:0.572391+0.013004
[52] train-mlogloss:0.495963+0.003772 test-mlogloss:0.571982+0.013247
[53] train-mlogloss:0.494604+0.004112 test-mlogloss:0.571800+0.013221
[54] train-mlogloss:0.493227+0.004354 test-mlogloss:0.571670+0.013281
[55] train-mlogloss:0.491718+0.004465 test-mlogloss:0.571347+0.013369
[56] train-mlogloss:0.490451+0.004008 test-mlogloss:0.571079+0.013434
[57] train-mlogloss:0.488950+0.003803 test-mlogloss:0.570800+0.013453
[58] train-mlogloss:0.487356+0.003738 test-mlogloss:0.570614+0.013497
[59] train-mlogloss:0.486000+0.003789 test-mlogloss:0.570397+0.013585
[60] train-mlogloss:0.484655+0.003775 test-mlogloss:0.570149+0.013511
[61] train-mlogloss:0.483656+0.003801 test-mlogloss:0.569981+0.013575
[62] train-mlogloss:0.482649+0.003641 test-mlogloss:0.569836+0.013569
[63] train-mlogloss:0.481344+0.003744 test-mlogloss:0.569783+0.013715
[64] train-mlogloss:0.479968+0.003893 test-mlogloss:0.569673+0.013719
[65] train-mlogloss:0.478737+0.003780 test-mlogloss:0.569546+0.013598
[66] train-mlogloss:0.477278+0.003814 test-mlogloss:0.569390+0.013506
[67] train-mlogloss:0.475840+0.003847 test-mlogloss:0.569206+0.013539
[68] train-mlogloss:0.474472+0.003706 test-mlogloss:0.568915+0.013648
[69] train-mlogloss:0.473138+0.003746 test-mlogloss:0.568809+0.013734
[70] train-mlogloss:0.472044+0.003487 test-mlogloss:0.568647+0.013801
[71] train-mlogloss:0.470815+0.003778 test-mlogloss:0.568539+0.013922
[72] train-mlogloss:0.469652+0.003934 test-mlogloss:0.568433+0.013879
[73] train-mlogloss:0.468603+0.004175 test-mlogloss:0.568296+0.013916
[74] train-mlogloss:0.467411+0.004046 test-mlogloss:0.568290+0.014004
[75] train-mlogloss:0.466149+0.003956 test-mlogloss:0.568114+0.013943
[76] train-mlogloss:0.465339+0.003658 test-mlogloss:0.568005+0.013991
[77] train-mlogloss:0.463978+0.003922 test-mlogloss:0.567910+0.014069
[78] train-mlogloss:0.462658+0.004264 test-mlogloss:0.567847+0.014041
[79] train-mlogloss:0.461700+0.004212 test-mlogloss:0.567769+0.014106
[80] train-mlogloss:0.460921+0.004252 test-mlogloss:0.567733+0.014124
[81] train-mlogloss:0.460093+0.004161 test-mlogloss:0.567685+0.014119
[82] train-mlogloss:0.459198+0.004137 test-mlogloss:0.567670+0.014064
[83] train-mlogloss:0.458149+0.004068 test-mlogloss:0.567595+0.014066
[84] train-mlogloss:0.457133+0.004007 test-mlogloss:0.567546+0.014109
[85] train-mlogloss:0.456097+0.003927 test-mlogloss:0.567575+0.014126
[86] train-mlogloss:0.455198+0.003818 test-mlogloss:0.567499+0.014133
[87] train-mlogloss:0.454256+0.003832 test-mlogloss:0.567522+0.014231
[88] train-mlogloss:0.453194+0.003850 test-mlogloss:0.567455+0.014160
[89] train-mlogloss:0.451930+0.003552 test-mlogloss:0.567428+0.014235
[90] train-mlogloss:0.450709+0.003518 test-mlogloss:0.567446+0.014219
[91] train-mlogloss:0.449615+0.003564 test-mlogloss:0.567445+0.014346
[92] train-mlogloss:0.448628+0.003476 test-mlogloss:0.567423+0.014340
[93] train-mlogloss:0.447457+0.003516 test-mlogloss:0.567432+0.014323
[94] train-mlogloss:0.446311+0.003508 test-mlogloss:0.567525+0.014353
[95] train-mlogloss:0.445222+0.003437 test-mlogloss:0.567527+0.014361
Stopping. Best iteration: 93
xgb_model <- xgb.train(params, xgb_d_train,
nrounds = 500, early.stop.round = 3, nthread = 4,
watchlist = list(val = xgb_d_val))
[0] val-mlogloss:0.979630
[1] val-mlogloss:0.898859
[2] val-mlogloss:0.840962
[3] val-mlogloss:0.798048
[4] val-mlogloss:0.761406
[5] val-mlogloss:0.732785
[6] val-mlogloss:0.711867
[7] val-mlogloss:0.693052
[8] val-mlogloss:0.679435
[9] val-mlogloss:0.669895
[10] val-mlogloss:0.658097
[11] val-mlogloss:0.651483
[12] val-mlogloss:0.643463
[13] val-mlogloss:0.635880
[14] val-mlogloss:0.631997
[15] val-mlogloss:0.627317
[16] val-mlogloss:0.624271
[17] val-mlogloss:0.620170
[18] val-mlogloss:0.617388
[19] val-mlogloss:0.615234
[20] val-mlogloss:0.614006
[21] val-mlogloss:0.609879
[22] val-mlogloss:0.608206
[23] val-mlogloss:0.606996
[24] val-mlogloss:0.605774
[25] val-mlogloss:0.604661
[26] val-mlogloss:0.603672
[27] val-mlogloss:0.602863
[28] val-mlogloss:0.601464
[29] val-mlogloss:0.600711
[30] val-mlogloss:0.599235
[31] val-mlogloss:0.598486
[32] val-mlogloss:0.597323
[33] val-mlogloss:0.596512
[34] val-mlogloss:0.595787
[35] val-mlogloss:0.595490
[36] val-mlogloss:0.594993
[37] val-mlogloss:0.594946
[38] val-mlogloss:0.594590
[39] val-mlogloss:0.594271
[40] val-mlogloss:0.593994
[41] val-mlogloss:0.593954
[42] val-mlogloss:0.593739
[43] val-mlogloss:0.593048
[44] val-mlogloss:0.592291
[45] val-mlogloss:0.591669
[46] val-mlogloss:0.591511
[47] val-mlogloss:0.591226
[48] val-mlogloss:0.590998
[49] val-mlogloss:0.590114
[50] val-mlogloss:0.589879
[51] val-mlogloss:0.589760
[52] val-mlogloss:0.589245
[53] val-mlogloss:0.589075
[54] val-mlogloss:0.588898
[55] val-mlogloss:0.588851
[56] val-mlogloss:0.588768
[57] val-mlogloss:0.588583
[58] val-mlogloss:0.588290
[59] val-mlogloss:0.588096
[60] val-mlogloss:0.588068
[61] val-mlogloss:0.587981
[62] val-mlogloss:0.587833
[63] val-mlogloss:0.587991
[64] val-mlogloss:0.587849
[65] val-mlogloss:0.588040
Stopping. Best iteration: 63
xgb_model
$handle
<pointer: 0x1309a2e70>
attr(,"class")
[1] "xgb.Booster.handle"
$raw
[1] 00 00 00 3f 1d 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[35] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[69] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[103] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[137] 0e 00 00 00 00 00 00 00 6d 75 6c 74 69 3a 73 6f 66 74 70 72 6f 62 06 00 00 00 00 00 00 00 67 62 74 72
[171] 65 65 c6 00 00 00 00 00 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00
[205] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[239] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[273] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[307] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 77 00 00 00
[341] 00 00 00 00 00 00 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[375] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[409] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[443] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[477] 00 00 00 00 ff ff ff ff 01 00 00 00 02 00 00 00 0e 00 00 00 34 47 14 3f 00 00 00 80 03 00 00 00 04 00
[511] 00 00 0e 00 00 00 6a 84 c5 3e 00 00 00 00 05 00 00 00 06 00 00 00 1a 00 00 80 d0 9b 39 3f 01 00 00 80
[545] 07 00 00 00 08 00 00 00 0e 00 00 00 ac b0 8a 3e 01 00 00 00 09 00 00 00 0a 00 00 00 04 00 00 80 00 d0
[579] 15 45 02 00 00 80 0b 00 00 00 0c 00 00 00 04 00 00 80 00 d0 fb 44 02 00 00 00 0d 00 00 00 0e 00 00 00
[613] 0d 00 00 80 62 a2 16 3e 03 00 00 80 0f 00 00 00 10 00 00 00 0e 00 00 00 1e 9f 3e 3e 03 00 00 00 11 00
[647] 00 00 12 00 00 00 04 00 00 80 00 0c cb 45 04 00 00 80 13 00 00 00 14 00 00 00 02 00 00 80 5a 33 23 42
[681] 04 00 00 00 15 00 00 00 16 00 00 00 1a 00 00 80 b5 20 34 3f 05 00 00 80 17 00 00 00 18 00 00 00 04 00
[715] 00 80 00 40 d3 44 05 00 00 00 19 00 00 00 1a 00 00 00 0e 00 00 80 0e 43 4a 3f 06 00 00 80 1b 00 00 00
[749] 1c 00 00 00 07 00 00 80 00 00 28 41 06 00 00 00 1d 00 00 00 1e 00 00 00 04 00 00 80 00 30 49 45 07 00
[783] 00 80 1f 00 00 00 20 00 00 00 07 00 00 80 00 00 78 41 07 00 00 00 21 00 00 00 22 00 00 00 19 00 00 80
[817] f2 15 3f 3e 08 00 00 80 23 00 00 00 24 00 00 00 02 00 00 80 dd 04 23 42 08 00 00 00 25 00 00 00 26 00
[851] 00 00 04 00 00 80 00 8c da 45 09 00 00 80 27 00 00 00 28 00 00 00 02 00 00 80 96 f4 22 42 09 00 00 00
[885] 29 00 00 00 2a 00 00 00 07 00 00 80 00 00 00 3f 0a 00 00 80 2b 00 00 00 2c 00 00 00 17 00 00 00 00 00
[919] 32 42 0a 00 00 00 2d 00 00 00 2e 00 00 00 11 00 00 00 00 00 bc 41 0b 00 00 80 2f 00 00 00 30 00 00 00
[953] 0d 00 00 00 00 00 08 3e 0b 00 00 00 31 00 00 00 32 00 00 00 0d 00 00 00 3c f2 68 3e 0c 00 00 80 33 00
[987] 00 00 34 00 00 00 04 00 00 80 00 98 3e 45
[ reached getOption("max.print") -- omitted 635740 entries ]
$bestScore
[1] 0.587833
$bestInd
[1] 63
attr(,"class")
[1] "xgb.Booster"
importance <- xgb.importance(feature_names = colnames(d_train), model = xgb_model)
xgb.plot.importance(importance)
#plot(varImp(xgb_model))
How well does the model predict?
val_prob <- matrix(predict(xgb_model, xgb_d_val), ncol = 3, byrow=T) %>% data.frame()
colnames(val_prob) <- c("low", "medium", "high")
val_prob <- data.frame(high = val_prob$high, low=val_prob$low, medium=val_prob$medium)
val_label <- probs_to_label(val_prob)$label
confusionMatrix(d_val$interest_level, val_label)
Confusion Matrix and Statistics
Reference
Prediction high low medium
high 224 267 468
low 43 7926 602
medium 149 1663 995
Overall Statistics
Accuracy : 0.7413
95% CI : (0.7334, 0.749)
No Information Rate : 0.7989
P-Value [Acc > NIR] : 1
Kappa : 0.36
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: high Class: low Class: medium
Sensitivity 0.53846 0.8042 0.48184
Specificity 0.93834 0.7400 0.82360
Pos Pred Value 0.23358 0.9247 0.35447
Neg Pred Value 0.98313 0.4875 0.88772
Prevalence 0.03372 0.7989 0.16738
Detection Rate 0.01816 0.6425 0.08065
Detection Prevalence 0.07773 0.6947 0.22753
Balanced Accuracy 0.73840 0.7721 0.65272
sprintf("log loss = %.4f", logloss(d_val$interest_level, val_prob))
[1] "log loss = 0.5880"
Model is predicting high & low better than medium, even though high has much lower prevalence (only 4%). But this makes sense as we found in our prior analysis that there are few features that distinguish high from low and medium. Might perform even better with one vs. all.